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Abstract 

Boosted decision trees are applied to particle identification in the MiniBooNE 
experiment operated at Fermi National Accelerator Laboratory (Fermilab) for 
neutrino oscillations. Numerous attempts are made to tune the boosted decision 
trees, to compare performance of various boosting algorithms, and to select input 
variables for optimal performance. 



1 Introduction 

In High Energy Physics (HEP) experiments, people usually need to select some events 
with specific interest, so called signal events, out of numerous background events for 
study. In order to increase the ratio of signal to background, one needs to suppress 
background events while keeping high signal efficiency To this end, some advanced 
techniques, such as AdaBoostpP, e-Boostj2], e-LogitBoost [2], e-HingeBoost, Random 
Forests jSj etc., from Statistics and Computer Sciences were introduced for signal and 
background event separation in the MiniBooNE experiment jl] at Fermilab. The Mini- 
BooNE experiment is designed to confirm or refute the evidence for z/ M — > v e oscillations 
at Am 2 ~ 1 eV 2 /c A found by the LSND experiment [5]. It is a crucial experiment which 
will imply new physics beyond the standard model if the LSND signal is confirmed. 
These techniques are tuned with one sample of Monte Carlo (MC) events, the training 
sample, and then tested with an independent MC sample, the testing sample. Ini- 
tial comparisons of these techniques with artificial neural networks (ANN) using the 
MiniBooNE MC samples were described previously [6 . This work indicated that the 
method of boosted decision trees is superior to the ANNs for Particle Identification 
(PID) using the MiniBooNE MC samples. Further studies show that the boosted de- 
cision tree method has not only better event separation, but is also more stable and 
robust than ANNs when using MC samples with varying input parameters. 

The boosting algorithm is one of the most powerful learning techniques introduced 
in the past decade [TJ, [HJ [TD| . The motivation for the boosting algorithm is to design a 
procedure that combines many "weak" classifiers to achieve a final powerful classifier. 
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In the present work numerous trials are made to tune the boosted decision trees, and 
comparisons are made for various algorithms. For a large number of discriminant 
variables, several techniques are described to select a set of powerful input variables in 
order to obtain optimal event separation using boosted decision trees. Furthermore, 
post-fitting of weights for the trained boosting trees is also investigated to attempt 
further possible improvement. 

This paper is focussed on the boosting tuning. All results appearing in this pa- 
per are relative numbers. They do not represent the MiniBooNE PID performance; 
that performance is continually improving with further algorithm and PID study. The 
description of the MiniBooNE reconstruction packages|lH I12j. the reconstructed vari- 
ables, the overall and absolute performance of the boosting PID j!3|. fT^j . the validation 
of the input variables and the boosting PID variables by comparing various MC and 
real data samples fH>\ will be described in future articles. 

2 Decision Trees 

Boosting algorithms can be applied to any classifier, Here they are applied to decision 
trees. A schematic of a simple decision tree is shown in Figure 1, S means signal, B 
means background, terminal nodes called leaves are shown in boxes. The key issue 
is to define a criterion that describes the goodness of separation between signal and 
background in the tree split. Assume the events are weighted with each event having 
weight Wi. Define the purity of the sample in a node by 

where J2 S is the sum over signal events and J2b * s the sum over background events. 
Note that P(l — P) is if the sample is pure signal or pure background. For a given 
node let 

n 

dni = Q2Wi)P(l-P), 

i=l 

where n is the number of events on that node. The criterion chosen is to minimize 

Ginil e ft child + Giniright child- 

To determine the increase in quality when a node is split into two nodes, one 
maximizes 

Criterion = Gini father - Ginii eft chUd - Gini right chad . 

At the end, if a leaf has purity greater than 1/2 (or whatever is set), then it is called 
a signal leaf, otherwise, a background leaf. Events are classified signal (have score of 
1) if they land on a signal leaf and background (have score of -1) if they land on a 
background leaf. The resulting tree is a decision tree. 

Decision trees have been available for some timejZj. They are known to be powerful 
but unstable, i.e., a small change in the training sample can produce a large change in 
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Figure 1: Schematic of a decision tree. 



the tree and the results. Combining many decision trees to make a "majority vote" , as 
in the random forests method, can improve the stability somewhat. However, as will be 
discussed in Section 6, the performance of the random forests method is significantly 
worse than the performance of the boosted decision tree method in which the weights 
of misclassified events are boosted for succeeding trees. 

3 Some Boosting Algorithms 

If there are N total events in the sample, the weight of each event is initially taken as 
1/N. Suppose that there are M trees and m is the index of an individual tree. Let 

• Xi = the set of PID variables for the ith event. 

• i/i — 1 if the ith event is a signal event and yi — — 1 if the event is a background 
event. 

• Wi — the weight of the ith event. 

• T m (xi) = 1 if the set of variables for the ith event lands that event on a signal leaf 
and T m (xi) = —1 if the set of variables for that event lands it on a background 
leaf. 

• I(yi ^ T m (xi)) = 1 if yi ^ T m (xi) and if y» = T m (xi). 

There are several commonly used algorithms for boosting the weights of the misclas- 
sified events in the training sample. The boosting performance is quite different using 
various ways to update the event weights. 



3 



3.1 AdaBoost 

The first boosting method is called "AdaBoost" pQ or sometimes discrete AdaBoost. 
Define for the mth tree: 

Y£=iWiI(Vi ^ T m (xi)) 



err r , 



Calculate: 

a m = (3 x ln((l - err m )/err m ). 
(3 = 1 is the value used in the standard AdaBoost method. Change the weight of each 



event i, i = 1, N. 



Wj — > it;,- x e Q ™-'W r ™0 r ;)). 



Renormalize the weights. 



Ei=i w< 

The score for a given event is 

A/ 

T ( X ) = H "mT m (x), 
m=l 

which is just the weighted sum of the scores of the individual trees. 
3.2 e-Boost 

A second boosting method is called "e-Boost" |2J, or sometimes "shrinkage". After the 
mth tree, change the weight of each event i, i = 1, N. 

w . ^ w . e 2eI(y^T M ) ^ 

where e is a constant of the order of 0.01. Renormalize the weights. 

Wi 



W; 



The score for a given event is 



M 



m=l 

which is the renormalized, but unweighted, sum of the scores over individual trees. 
3.3 e-LogitBoost 

A third boosting method is called "e-LogitBoost" . This method is quite similar to 
e-Boost, but the weights are updated according to: 

e -ViT(xi) 

Wi 



I _|_ e -ViT{xi) " 

where Tlx) = T(x) + e x T m (x) for the mth tree iteration. 
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3.4 e-HingeBoost 

A fourth boosting method is called "e-HingeBoost" . Again this method is quite similar 
to e-Boost, but here the weights are updated according to: 

Wi = l if yiT(xi) < 1; w { = if yiT(xi) > 1 

where T(x) = T(x) + e x T m (x) for the mth tree iterations. 

3.5 LogitBoost 

A fifth boosting method is called "LogitBoost" [2J . Let y* = 1 for signal events and 
y* = for background events. Initial probability estimates are set to p(xi) = 0.5 for 
event i, where x is the set of PID variables. Let: 

^ = y* - p(xi) 

p(xi)(l - p(xi)) } 
Wi =p(Xi)(l -p(xi)), 

where Wi is the weight of event i. Let z be the weighted average of z over some set 
of events. Instead of the Gini criterion, the splitting variable and point to divide the 
events at a node into two nodes L and R is determined to minimize 

Wi{zi - z L ) 2 + w i( z i ~ zrY- 

L R 

The output for tree m is T^(xi) = 1 for the node onto which event i falls. The total 
output score is: 

M i 

T(x) = E ~T* m {x). 

m=l Z 

The probability is updated according to: 

e T(x) 

p W = e T(x) + e -T(x) 1 

3.6 Gentle AdaBoost 

A sixth boosting method is called "Gentle AdaBoost" [2\ . It uses same criterion as 
described for the LogitBoost. Here Zi is same as y^. The weights are updated according 
to: 

Wi -> Wie-^^-V 

for signal events and 

Wi -> w i e +{2pm{xi) - 1) 

for background events, where p m { x i) is the weighted purity of the leaf on which event 
i falls. 
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Figure 2: Tuning /3 (left) and decision tree size (right) using AdaBoost. 



3.7 Real AdaBoost 

A seventh boosting method is called "Real AdaBoost" [2]. It is similar to the discrete 
version of AdaBoost described in Section 3.1, but the weights and event scores are 
calculated in different ways. The event score for event i in tree m is given by: 

T m (xi) = 0.5 x \n(p m (xi)/(l. -Pm(xi))) 

where p m (xi) is the weighted purity of the leaf on which event i falls. The event weights 
are updated according to: 

Wi —> Wi x e (-y* xT ™( x ^ 

and then renormalized so that the total weight is one. The total output score including 
all of the trees is given by T(x) = J2m=i T m { x ) 



4 Tuning Parameters for the Boosted Decision Trees 

MiniBooNE MC samples from the February 2004 Baseline MC were used to tune some 
parameters of the boosted decision trees. There are 88233 intrinsic v e signal events 
and 162657 v ^ background events. 20000 signal events and 30000 background events 
were selected randomly for the training sample and the rest of the events were the test 
sample. The number of input variables for boosting training is 52. The relative ratio 
is defined as the background efficiency divided by the corresponding signal efficiency 
and rescaled by a constant value. 

The left plot of Figure 2 shows the relative ratio versus the signal efficiency for 
AdaBoost with 45 leaves per decision tree and various (3 values for 1000 tree iterations. 
The boosting performances slightly depend on the (3 values. AdaBoost with (3=1 
works slightly better in the high signal efficiency region (Eff > 65%) but worse in the 



6 



low signal efficiency region (Eff < 60%) than AdaBoost with smaller (3 values, 0.8, 0.5 
or 0.3. To balance the overall performance, f3 = 0.5 is selected to replace the standard 
value 1 for the AdaBoost training. 

The right plot of Figure 2 shows the relative ratio versus the signal efficiency for 
AdaBoost with /3 = 0.5 and 1000 tree iterations for various decision tree sizes ranging 
from 8 leaves to 100 leaves. AdaBoost with a large tree size worked significantly 
better than AdaBoost with a small tree size, 8 leaves; the latter number has been 
recommended in some statistics literature [TH] ■ Typically, it takes more tree iterations 
for the smaller tree size to reach optimal performance. For this application, even with 
more tree iterations (10000 trees), results from boosting with small tree size (8 leaves) 
are still significantly worse (~10%-20%) than results obtained with large tree size (45 
leaves). Here, 45 leaves per decision tree is selected (this number is quite close to the 
number of input variables, 52, for the boosting training.) 

How many decision trees are sufficient? It depends on the MC samples for boosting 
training and testing. For the given set of boosting parameters selected above, we 
ran boosting with 1000 tree iterations. The left plot of Figure 3 shows the relative 
ratio versus the signal efficiency for AdaBoost with tree iterations of 100, 200, 500, 
800 and 1000, respectively. The boosting performance becomes better with more tree 
iterations. The right plot of Figure 3 shows the relative ratio versus the number of 
decision trees for signal efficiencies of 50%, 60% and 70% which cover the regions of 
greatest interest for the MiniBooNE experiment. Typically, the boosting performance 
for low signal efficiencies converges after few hundred tree iterations and is then stable. 
For high signal efficiency, boosting performance continues to improve as the number of 
decision trees is increased. For these particular MC samples, the boosting performance 
is close to optimal after 1000 tree iterations. For the sake of comparison, the AdaBoost 
performance of the boosting training MC samples is also shown in the right plot of 
Figure 3. The relative ratios drop quickly down to zero (zero means no background 
events left after selection for a given signal efficiency) within 100 tree iterations for 
50%-70% signal efficiencies. The AdaBoost outputs for the training MC sample and 
for the testing MC sample for N tre e = 1, 100, 500 and 1000 are shown in Figure 4. 
The signal and background separation for the training sample becomes better as the 
number of tree iterations increases. The signal and background events are completely 
distinguished after about 500 tree iterations. For the testing samples, however, the 
signal and background separations are quite stable after a few hundred tree iterations. 
The corresponding relative ratios are stable for given signal efficiencies as shown in 
right plot of Figure 3. 

The tuning parameter for e-Boost is e. The left plot of Figure 5 shows the relative 
ratio versus the signal efficiency for e-Boost with e values of 0.005, 0.01, 0.02, 0.04, 
respectively. e-Boost with fairly large e values for 45 leaves per decision tree and 1000 
tree iterations has better performance for the high signal efficiency region (Eff > 50%). 
The results from AdaBoost with (3=0.5 are comparable to those from e-Boost. e-Boost 
with e > 0.01 works slightly better because e-Boost converges more quickly with larger 
e values. However, with more tree iterations, the final performances for different e 
values are very comparable. Here e = 0.01 is chosen for further comparisons. 
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Figure 3: Tuning decision tree iterations using AdaBoost (left) and the relative ratio 
versus the number of tree iterations for signal efficiencies ranging from 50% to 70% 
(right). (3 = 0.5 and there are 45 leaves per decision tree. 



The right plot of Figure 5 shows the relative ratio versus the signal efficiency for 
AdaBoost and e-Boost using two different ways to split tree nodes. One way is to 
maximize the criterion based on the Gini index to select the next tree split, the other 
way is to split the left tree node first. For AdaBoost, the performance for the "left 
node first" method gets worse for signal efficiency less than about 65%. At about the 
same signal efficiency, the performance for the two e-Boosts are quite comparable and 
are comparable with AdaBoost based on the Gini index. However, the e-Boost method 
based on the Gini index becomes worse than the others for high signal efficiency. 

Larger e makes e-Boost converge more quickly, but increasing the the size of deci- 
sion trees also makes e-Boost converge more quickly. The performance comparison of 
AdaBoost with different tree sizes shown in the right plot of Figure 2 is for the same 
number of tree iterations (1000). To make a fair comparison for the boosting perfor- 
mance with different tree sizes, it is better to let them have a similar number of total 
tree leaves. The top left plot of the Figure 6 shows the relative ratio versus the signal 
efficiency for AdaBoost and e-Boost with similar numbers of the total tree leaves, 1800 
tree iterations for 45 leaves per tree and 10000 tree iterations for 8 leaves per tree. For 
a small decision tree size of 8 leaves, the performance of the e-Boost is better than 
that of AdaBoost for 10000 trees. For a large decision tree size of 45 leaves, e-Boost 
has slightly better performance than AdaBoost at low v e signal efficiency (<65%), but 
worse at high u e signal efficiency (>70%). The comparison between small tree size 
(8 leaves) and large tree size (45 leaves) with comparable overall decision tree leaves 
indicates that large tree size with 45 leaves yields ~10%-20% better performance for 
the MiniBooNE Monte Carlo samples. 

The other five plots in Figure 6 show the relative ratio versus the number of tree 
iterations for AdaBoost and e-Boost with 45 leaves and 8 leaves assuming signal ef- 
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Figure 5: Tuning e-Boost (left) and comparing two ways to split the tree branches 
(right). 



ficiencies of 40%, 50%, 60% ,70% ,80%, respectively. The maximum number of tree 
iterations is 5000 for the large tree size of 45 leaves and 10000 for the small tree size of 
8 leaves. Usually, the performance of the boosting method becomes better with more 
tree iterations in the beginning; then at some point, it may reach an optimal value 
and gradually get worse with increasing number of trees, especially in the low signal 
efficiency region. The turning point of the boosting performance depends on the signal 
efficiency and MC samples used for training and test. 

Generally, if the number of weighted signal events is larger than the number of 
weighted background events in a given leaf, it is called a signal leaf, otherwise, a 
background leaf. Here, the threshold value for signal purity is 50% for the leaf to be 
called a signal leaf. This threshold value can be modified, say, to 30%, 40%, 45%, 60% 
or 70%. It is seen in Figure 7 that the performance of boosted decision trees with 
Adaboost degrades for threshold values away from the central value of 50%. Especially 
for threshold values away from 50%, the err m of mth tree often converges to 0.5 within 
about 100 tree iterations; after that the weights of the misclassified events do not 
successfully update because a m = (3 x ln((l — err m ) / err m ) = if err m = 0.5. Then 
Wi = Wi x e amXl ( yi ^ Tm ( Xi ^ remains the same as for the previous tree. Typically, the 
err m value increases for the first 100-200 tree iterations and then remains stable for 
further tree iterations, causing the weight of mth tree, a m , to decrease for the first 
100-200 tree iterations and then remain stable. For practical use of the AdaBoost 
algorithm, a lower limit, say, 0.01, on a m will avoid the impotence of the succeeding 
boosted decision trees. 

This problem is unlikely to happen for e-Boost because the weights of misclassi- 
fied events are always updated by the same factor, e 2tXl (yi^ T ™(xi)) _ jf differing purity 
threshold values are applied to boosted decision trees with e-Boost, the performance 
peaks around 50% and slightly worsens, typically within 5%, for other values ranging 
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Figure 6: Boostings with different tree sizes. 
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Figure 7: Performance of AdaBoost with j3 = 0.5, 45 leaves per tree and 1000 tree 
iterations for various threshold values for signal purity to determine whether the tree 
leaf is signal leaf or background leaf. 



from 30% to 70%. 

The unweighted misclassified event rate, weighted misclassified event rate err m and 
a m for the boosted decision trees with the AdaBoost algorithm versus the number of 
tree iterations are shown in Figure 8, for a signal purity threshold value of 50%. From 
this plot, it is clear that, after a few hundred tree iterations, an individual boosted 
decision tree has a very weak discriminant power (i.e., is a "weak" classifier). The 
err m is about 0.4-0.45, corresponding to a m of around 0.2-0.1. The unweighted event 
discrimination of an individual tree is even worse, as is also seen in Figure 8. Boosted 
decision trees focus on the misclassified events which usually have high weights after 
hundreds of tree iterations. The advantage of the boosted decision trees is that the 
method combines all decision trees, "weak" classifiers, to make a powerful classifier as 
stated in the Introduction section. 

When the weights of misclassified events are increased (boosted), some events which 
are very difficult correctly classify obtain large event weights. In principle, some outliers 
which have large event weights may degrade the boosting performance. To avoid this 
effect, it might be useful to set an upper limit for the event weights to trim some outliers. 
It is found that setting a weight limit doesn't improve the boosting performance, and, in 
fact, may degrade the boosting performance slightly. However, the effect was observed 
to be within one standard deviation for the statistical error. One might also trim events 
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Figure 8: The unweighted, weighted misclassified event rate (err m ), a m versus the 
number of tree iterations for AdaBoost with f3 = 0.5 and signal purity threshold value 



with very low weights which can be correctly classified easily to provide a better chance 
for difficult events. No apparent improvement or degradation was observed considering 
the statistical error. These results may indicate that the boosted decision trees have 
the ability to deal with outliers quite well and to focus on the events located around the 
boundary regions where it is difficult to correctly distinguish signal and background 
events. 

5 Comparison Among Boosting Algorithms 

Besides AdaBoost and e-Boost, there are other algorithms such as e-LogitBoost, and e- 
HingeBoost which use different ways of updating the event weights for the misclassified 
events. The four plots of Figure 9 show the relative ratio versus the signal efficiency 
for various boostings with different tree sizes. The top left, top right, bottom left, 
and bottom right plots are for 500, 1000, 2000, and 3000 tree iterations, respectively. 
Boosting with a large tree size of 45 leaves is seen to work better than boosting with 
a small tree size of 8 leaves as noted above. AdaBoost and e-Boost have comparable 
performance, slightly better than that of e-LogitBoost. e-HingeBoost is the worst 
among these four boosting algorithms, especially for the low signal efficiency region. 

The top left, top right, bottom left and bottom right plots of Figure 10 show the rel- 
ative ratio versus the signal efficiency with 45 leaves of e-Boost, AdaBoost, e-LogitBoost 
and e-HingeBoost for varying numbers of tree iterations. Generally, boosting perfor- 
mance continuously improves with an increase in the number of tree iterations until 
an optimum point is reached. From the two top plots, it is apparent that e-Boost 
converges more slowly than does AdaBoost; however, with about 1000 tree iterations, 
their performances are very comparable. There is only marginal improvement beyond 
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Figure 9: Performance comparison of various Boostings. 
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Figure 10: Performance comparison of various Boostings with large tree size. 
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Figure 11: Performance comparison of various Boostings with small tree size of 8 leaves 
per tree. 
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1000 tree iterations for high signal efficiency, and the performance may get worse for 
the low signal efficiency region if the boosting is over-trained (goes beyond the optimal 
performance range). Similar plots for the four boosting algorithms with 8 leaves per 
decision tree are shown in the Figure 11. Results for e-HingeBoost with 30 and 8 tree 
leaves are shown in the bottom right plots of Figures 10 and 11. The performance for 
200 tree iterations seems worse than that for 100 tree iterations. This may indicate 
that its performance is unstable in the first few hundred tree iterations, but works well 
after about 500 tree iterations. However, the overall performance of e-Hinge boost is 
the worst among the four boosting algorithms described above. 

For some purposes, LogitBoost has been found to be superior to other algorithms|17j. 
For the MiniBooNE data, it was found to have about 10%-20% worse background con- 
tamination for a fixed signal efficiency than the regular AdaBoost. LogitBoost con- 
verged very rapidly after less than 200 trees and the contamination ratio got worse 
past that point. A modification of LogitBoost was tried in which the convergence was 
slowed by taking T(x) = J2m=i e x \T m (x), the extra factor of e = 0.1 slowing the 
weighting update rate. This indeed improved the performance considerably, but the 
results were still slightly worse than obtained with AdaBoost or e-Boost for a tree size 
of 45 leaves. The convergence to an optimum point still took fewer than 300 trees, 
which was less than the number needed with AdaBoost or e-Boost. 

Gentle AdaBoost and Real AdaBoost were also tried; both of them were found 
slightly worse than the discrete AdaBoost. Relative error ratio versus signal efficiency 
for various boosting algorithms are listed in Tabled 

6 Comparison of AdaBoost and Random Forests 

The random forests is another algorithm which uses a "majority vote" to improve 
the stability of the decision trees. The training events are selected randomly with or 
without replacement. Typically, one half or one third of the training events are selected 
for each decision tree training. The input variables can also be selected randomly for 
determining the tree splitters. There is no event weight update for the misclassified 
events. For the AdaBoost algorithm, each tree is built using the results of the previous 
tree; for the random forests algorithm, each tree is independent of the other trees. 

Figure 12 shows a comparison between random forests of different tree sizes and 
Adaboost, both with 1000 tree iterations. Large tree size is preferred for the random 
forests (The original random forests method lets each tree develop fully until all tree 
leaves are pure signal or background). In this study a fixed number of tree leaves were 
used. The performance of the random forests algorithm with 200 or 400 tree leaves is 
about equal. Compared with AdaBoost, the performance of the random forests method 
is significantly worse. The main reason for the inefficiency is that there is no event 
weight update for the misclassified events. One of main advantages for the boosting 
algorithm is that the weights of misclassified events are boosted which makes it possible 
for them to be correctly classified in succeeding tree iterations. 

Considering this advantage, an event weight update algorithm (AdaBoost) was used 
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Table 1: Relative error ratio versus signal efficiency for various boosting algorithms for 
MiniBooNE data. Differences up to about 0.03 are largely statistical. b=0.5 means 
smooth scoring function described in Section 9. 
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Figure 12: Performance comparison of AdaBoost, random forests and boosted random 
forests with 1000 tree iterations. 



to boost the random forests. The performances of the boosted random forests algorithm 
are then significantly better than those of the original random forests as can be seen 
in Figure 12. The performance of the AdaBoost with 100 leaves per decision tree is 
slightly better than that of the boosted random forests. Other tests were made using 
one half training events selected randomly for each tree together with 30%, 50%, 80% 
or 100% of the input variables selected randomly for each tree split. The performances 
of the boosted random forests method using the AdaBoost algorithm are very stable. 

The boosted random forests only uses one half or one third of the training events 
selected randomly for each tree and also only a fraction of the input variables for each 
tree split, selected randomly; This method has the advantage that it can run faster 
than regular AdaBoost while providing similar performance. In addition, it may also 
help to avoid over-training since the training events are selected partly and randomly 
for each decision tree. 

7 Post-fitting of the Boosted Decision Trees 

Some recent papers [IBJUSlEnj indicate that post-fitting of the trained boosted decision 
trees may help to make further improvement. One possibility is that a selected ensemble 
of many decision trees could be better than the ensemble of all trees. Here post-fitting 
of the weights of decision trees was tried. The basic idea is to optimize the boosting 
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performance by retuning the weights of the decision trees or even removing some of 
them by setting them to have weight. A Genetic Algorithm [2UI22] is used to optimize 
the weights of all trained decision trees. 

A new MC sample is used for this purpose. The MC sample is split into three 
subsamples, mcl, mc2 and mc3, each subsample having about 26700 signal events and 
21000 background events. Mcl is used to train AdaBoost with 1000 decision trees. The 
background efficiency for mcl, mc2 and mc3 for a signal efficiency of 60% are 0.12%, 
5.15% and 4.94%, respectively. If mcl is used for post-fitting, then the corresponding 
background efficiency can be driven down to 0.05%, but the background efficiency for 
test sample mc3 is about 5.5%. It has become worse after post-fitting. It seems that 
it is not good to use same sample for the boosting training and post-fitting. If mc2 is 
used for post-fitting, then the background efficiency goes down to 4.21% for the mc2, 
and 4.76% for the testing sample mc3. The relative improvement is about 3.6% and the 
statistical error for the background events is about 3.2%. Suppose the MC samples for 
post-fitting and testing are exchanged, mc3 is used for post-fitting while mc2 is used 
for testing. The background efficiency is 4.38% for training sample mc3 and 5.06% for 
the testing sample mc2. The relative improvement is about 1.5%. 

A second post-fitting program was tried, the Pathseeker program of J.H. Fried- 
man and B.E. Popescu|19[ I20j. a robust regularized linear regression and classification 
method. This program produced no overall improvement, with perhaps a marginal 4% 
improvement for 50% signal efficiency. It seems that post-fitting makes only a marginal 
improvement based on our studies. 

8 How to Select Input Variables 

One of the major advantages of the boosted decision tree algorithm is that it can 
handle large numbers of input variables as was pointed out previously [6]. Generally 
speaking, more input variables cover more information which may help to improve 
signal and background event separation. Often one can reconstruct several hundreds 
or even thousands of variables which have some discriminant power to separate signal 
and background events. Some of them are superior to others, and some variables may 
have correlations with others. Too many variables, some of which are "noise" variables, 
won't improve but may degrade the boosting performance. It is useful to select the most 
useful variables for boosting training to maximize the performance. New MC samples 
were generated with 182 reconstructed variables. In order to select the most powerful 
variables, all 182 variables were used as input to boosted decision trees running 150 
tree iterations. Then the effectiveness of the input variables was rated based on how 
many times each variable was used as a tree splitter. The first variable in the sorted 
list was regarded as the most useful variable for boosting training. The first 100 sorted 
input variables were selected to train AdaBoost with (3 = 0.5, 45 leaves per decision 
tree and 1000 tree iterations. The dependence of the number of times a variable is 
used as a tree splitter versus the number of tree iterations is shown for some selected 
input variables (Variables number 1, 5, 10, 20, 50, 80 and 100) in the top left plot with 
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Figure 13: Top plots show the dependence of the number of times a variable is used as 
a tree splitter versus the number tree iterations for some selected input variables. The 
top left plot is linear scale and the top right plot is log scale. The number of times a 
variable is used as a tree splitters versus all input variables with 1000 tree iterations 
is shown in the bottom left plot. The bottom right plot shows the corresponding Gini 
index contribution for all input variables. 
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linear scale and in the top right plot with log scale. 

In this way, the first 30, 40, 60, 80, 100, 120 and 140 input variables were selected 
from the sorted list to train boosted decision trees with 1000 tree iterations. A com- 
parison of their performance is shown in the left plot of Figure 14. The boosting per- 
formance steadily improves with more input variables until about 100 to 120. Adding 
further input variables doesn't improve and may degrade the boosting performance. 
The main reason for the degradation is that there is no further useful information in 
the additional input variables and these variables can be treated as "noise" variables 
for the boosting training. However, if the additional variables include some new infor- 
mation which is not included in the other variables, they should help to improve the 
boosting performance. 

So far only one way to sort the input variables has been described. Some other ways 
can also be used and work reasonably well as shown in the right plot of Figure 14. Listl 
means the input variables are sorted based on the how many times they were used as 
tree splitters for 150 tree iterations, list2 means the input variables are sorted based 
on their Gini index contributions for 150 tree iterations, and list3 means the variables 
are sorted according to which variables are used earlier than others as tree splitters for 
150 tree iterations. ListlOOl, listl002 and listl003 are similar to listl, list2 and list3, 
but use 1000 tree iterations. The first 100 input variables from the sorted lists are used 
for boosting training with 1000 tree iterations. The performances are comparable for 
100 input variables sorted in different ways. However, the boosting performances for 
listl and list3 are slightly better than the others. 

If an equal number of input variables of 100 are selected from each list, the number 
of variables which overlap typically varies from about 70 to 90 for the different lists. In 
other words, about 10 to 30 input variables are different among the various lists. In spite 
of these differences, the boosting performances are still comparable and stable. Further 
studies with MC samples generated using varied MC input parameters corresponding 
to systematic errors show that the boosting outputs are very stable even though some 
input variables vary quite a lot. If these same varied MC samples are applied to the 
ANNs, it turns out that boosted decision trees work significantly better than the ANNs 
for both event separation performance and for stability. 



9 Tests of the Scoring Function 

In the standard boost, the score for an event from an individual tree is a simple square 
wave depending on the purity of the leaf on which the event lands. If the purity is 
greater than 0.5, the score is 1 and otherwise it is —1. 

One can ask whether a smoother function of the purity might be more appropriate. 
If the purity of a leaf is 0.51, should the score be the same as if the purity were 0.99? 
Two possible alternative scores were tested. Let z = 2xpurity— 1. 

2 

A. score = — 1 H , 

e -az _|_ I 

B. score = sign(z) x \z\ b , 
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Figure 14: Left: boosting performances with different number of input variables. Right: 
boosting performances with the 100 input variables selected by using different ways. 
Totally, 1000 tree iterations are used. 
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Figure 15: Performance of AdaBoost with f3 = 0.5, 45 leaves per tree, and purity 
threshold value 0.5 for various parameters a (left) and b (right) values. 



23 



1.2 




0.75 I I 

200 400 600 800 1000 

Number of Tree Iterations 



Figure 16: Performance of AdaBoost with 6 = (step function) and b = 0.5 (smooth 
function), f3 = 0.5, 45 leaves per tree, versus tree iterations. 



where a and b are parameters. 

Tests were run for various parameter values for scores A and B and compared 
with the standard step function. Performance comparisons of AdaBoost for various 
parameters a (left) and b (right) values are shown in Figure 15. 

For a smooth function with b = 0.5, boosting performance converges faster than 
the original AdaBoost algorithm for the first few hundred decision trees, as shown in 
Figure 16. However, no evidence was found that the optimum was reached any sooner 
by the smooth function. The reason is that the smooth function of the purity describes 
the probability of a given event to be signal or background in more detail than the step 
function used in the original AdaBoost algorithm. With an increase in the number 
of tree iterations, however, the "majority vote" plays the most important role for the 
event separation. The ultimate performance of the smooth function with b = 0.5 is 
comparable to the performance of the standard AdaBoost. 

10 Some miscellaneous tests 

In MiniBooNE, one is trying to improve the signal to background ratio by more than 
a factor of 100. One might expect that one should start by giving the background a 
greater total weight than the signal. In fact, giving the background two to five times 
the weight of the signal slightly degraded the performance. Giving the background 0.5 
to 0.2 of the weight of the signal gave the same performance as equal initial weights. 
For one set of Monte Carlo runs the PID variables were carefully modified to be 
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Figure 17: Performance of AdaBoost with (3 = 0.5, 45 leaves per tree and 1000 tree 
iterations, using various number of background events for training. 20,000 signal events 
were used for the training. 



flat as functions of the energy of the event and the event location within the detector. 
This decreased the correlations between the PID variables. The performance of these 
corrected variables was compared with the performance of the uncorrected variables. 
As expected, the convergence was much faster at first for the corrected variable boost. 
However, as the number of trees increased, the performance of the uncorrected variable 
boost caught up with the other. For 1000 trees, the performance of the two boost tests 
was about the same. Over the long run, boost is able to compensate for correlations and 
dependencies, but the number of trees for convergence can be considerably shortened 
by making the PID variables independent. 

The number of MC events used to train the boosting effectively is also an impor- 
tant issue we have investigated. Generally, more training events are preferred, but 
it is impractical to generate unlimited MC events for training. The performance of 
AdaBoost with 1000 tree iterations, 45 tree leaves per tree using various number of 
background events ranging from 10000 to 60000 for training are shown in Figure 17, 
where the number of signal events is fixed 20000. For the MiniBooNE data, the use 
of 30000 or more background events works fairly well; fewer background events for 
training degrades the performance. 



25 



11 Conclusions 



PID input variables obtained using the event reconstruction programs for the Mini- 
BooNE experiment were used to train boosted decision trees for signal and background 
event separation. Numerous trials were made to tune the boosted decision trees. Based 
on the performance comparison of various algorithms, decision trees with the AdaBoost 
or the e-Boost algorithms are superior to the others. The major advantages of boosted 
decision trees include their stability, their ability to handle large number of input vari- 
ables, and their use of boosted weights for misclassified events to give these events a 
better chance to be correctly classified in succeeding trees. 

Boosting is a rugged classification method. If one provides sufficient training vari- 
ables and sufficient leaves for the tree, it appears that it will, eventually, converge to 
close to an optimum value. This assumes that e for e-Boost or f3 for Adaboost are not 
set too large. There are modifications of the basic boosting procedure which can speed 
up the convergence. Use of a smooth scoring function improves initial convergence. In 
the last section, it was seen that removing correlations of the input PID variables im- 
proved convergence speed. For some applications, the use of a boosted natural forests 
technique may also speed the convergence. 

For a large set of discriminant variables, several techniques can be used to select a 
set of powerful input variables to use for training boosted decision trees. Post-fitting of 
the boosted decision trees makes only a marginal improvement in the tests presented 
here. 
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